AMRflows Metagenomic Data Analysis Course

13. Genome binning

Genome binning is a common step in metagenomic analysis and can be thought of an extension of genome assembly. In this process, we group assembled contigs into individual genome bins, in order to reconstruct metagenome-assembled genomes (MAGs). Genome binning tools use a variety of different approaches to group these contigs including: sequence composition-based binning, sequencing depth and coverage based binning, machine learning based methods, and hybrid methods that combine multiple informaton sources. Each of these methods have their own pros and cons, and the choice of which to use is often sample and experiment dependent. A detailed review on these binning approaches can be found here.

Binning with MetaBat2

In this course we will be using MetaBat2 as our genome binning tool of choice. This tool is available through conda, but it is not recommended or supported by the developers of MetaBat. However, details on how to install the tool are quite clear on the tools bitbucket (Follow the non-Docker installation unless you are already familiar with Docker or want to experiment).

MetaBat2 requires two input files: the assembled contigs we generated with MEGAHIT and a sorted BAM file containing the read coverage data for each contig. This can be generated by mapping the reads that we used to assemble the contigs to the assembled contigs, with a mapping tool such as BWA. Install BWA via conda and run:

bwa mem final.contigs.fa sample1_R1.fq sample1_R2.fq > sample1_aln.sam

This generates a SAM file, but MetaBat2 requires a sorted BAM file. A SAM is a human-readable alignment format and a BAM file is the binary version of that. We can convert a SAM file to a sorted BAM file with SAMtools, which can be installed via conda.

samtools sort sample1_aln.sam -o sample1_aln_sorted.bam

Now that we have both our assembled contigs and a sorted bam file we can run MetaBat2 to get our MAGs. Look at the MetaBat2 bitbucket page and try and run MetaBat2 using our two input files and an output directory named sample1_metabat2_out, whilst keeping other options as default. MetaBat2 will then generate FASTA files for all the binned and unbinned contigs.

Evaluating Quality of Metagenomic Bins with CheckM

CheckM is a powerful tool designed to assess the quality of MAGs. It provides estimates of genome completeness and contamination by utilising lineage-specific marker genes that are ubiquitous and single-copy within a phylogenetic lineage, which helps ensure that the assembled genomes are reliable for downstream analyses.

CheckM can be easily installed via conda. Reference data is required for CheckM to function and details on how to download and prepare the data can be found on CheckM’s wiki. The standard workflow for CheckM is the lineage_wf which can be run with:

checkm lineage_wf <bin folder> <output folder>

CheckM provides several important functionalities for assessing genome quality including: * Estimation of Genome Completeness * Estimation of Genome Contamination * Identification of Potential Misassemblies

You should read the extensive CheckM wiki for details on how these metrics are calculated and what they mean. However, there are a few particular things to look out for in the CheckM output:

Low Completeness Scores: Bins with completeness scores below 70% may indicate that significant portions of the genome are missing.
High Contamination Scores: Contamination scores above 5% are concerning, as they suggest that the bin may contain DNA from multiple organisms, complicating downstream analyses.
Presence of Duplicate Marker Genes: If marker genes that are expected to be single-copy appear as duplicates, it may indicate contamination or misassemblies.
Inconsistent Lineage Assignments: If a genome bin is classified with a broad marker set (e.g., domain-level) rather than a specific lineage, it may indicate uncertainty in the classification and quality of the genome.

Decisions on what to do with bins are left up to the researcher but in my own pipelines I typically remove bins with less than 50% completeness and over 10% contamination. This ensures that the average quality of my bins are high, without discarding the majority of the bins. In summary, genome binning is a vital step in metagenomic analysis that allows for the reconstruction of individual genomes from complex samples. Using tools like MetaBat2 and CheckM, researchers can effectively group contigs and assess the quality of their assembled genomes.